Optimizing warfarin dosing for patients with atrial fibrillation using machine learning

While novel oral anticoagulants are increasingly used to reduce risk of stroke in patients with atrial fibrillation, vitamin K antagonists such as warfarin continue to be used extensively for stroke prevention across the world. While effective in reducing the risk of strokes, the complex pharmacodynamics of warfarin make it difficult to use clinically, with many patients experiencing under- and/or over- anticoagulation. In this study we employed a novel implementation of deep reinforcement learning to provide clinical decision support to optimize time in therapeutic International Normalized Ratio (INR) range. We used a novel semi-Markov decision process formulation of the Batch-Constrained deep Q-learning algorithm to develop a reinforcement learning model to dynamically recommend optimal warfarin dosing to achieve INR of 2.0–3.0 for patients with atrial fibrillation. The model was developed using data from 22,502 patients in the warfarin treated groups of the pivotal randomized clinical trials of edoxaban (ENGAGE AF-TIMI 48), apixaban (ARISTOTLE) and rivaroxaban (ROCKET AF). The model was externally validated on data from 5730 warfarin-treated patients in a fourth trial of dabigatran (RE-LY) using multilevel regression models to estimate the relationship between center-level algorithm consistent dosing, time in therapeutic INR range (TTR), and a composite clinical outcome of stroke, systemic embolism or major hemorrhage. External validation showed a positive association between center-level algorithm-consistent dosing and TTR (R2 = 0.56). Each 10% increase in algorithm-consistent dosing at the center level independently predicted a 6.78% improvement in TTR (95% CI 6.29, 7.28; p < 0.001) and a 11% decrease in the composite clinical outcome (HR 0.89; 95% CI 0.81, 1.00; p = 0.015). These results were comparable to those of a rules-based clinical algorithm used for benchmarking, for which each 10% increase in algorithm-consistent dosing independently predicted a 6.10% increase in TTR (95% CI 5.67, 6.54, p < 0.001) and a 10% decrease in the composite outcome (HR 0.90; 95% CI 0.83, 0.98, p = 0.018). Our findings suggest that a deep reinforcement learning algorithm can optimize time in therapeutic range for patients taking warfarin. A digital clinical decision support system to promote algorithm-consistent warfarin dosing could optimize time in therapeutic range and improve clinical outcomes in atrial fibrillation globally.


Exclusion criteria
We began with 29,272 patients in our combined data set.We excluded all patients who had no recorded dose(warfarin)-response(INR) pairs, as well as all patients with a weekly dose of > 140 mg (well beyond the reasonable range and indicative of data entry errors).For the external validation set, we also excluded any patients who were missing covariates required for performance estimation (described below), leaving a total of 28,232 patients.www.nature.com/scientificreports/

Data preprocessing
For all trials, we extracted patients in the warfarin arms.To ensure each warfarin dose recorded would have a corresponding INR measurement, we removed all warfarin doses recorded before a patient's first INR measurement and after their last INR measurement.
We extracted a subset of data from the trials based on clinical expert input about the features most relevant to warfarin dosing decisions.We extracted the following features for inclusion in our study: age, sex, weight, region, smoking status, concomitant medications (aspirin/amiodarone/thienopyridines), diabetes status, history of congestive heart failure, history of hypertension, history of myocardial infarction, current absolute warfarin dose, current INR, previous 4 INRs, whether the patient experienced a stroke during the study period, whether the patient experienced a major bleed during the study period, and whether the patient was hospitalized during the study period.
We discretized some continuous variables, which is known to aid in the learning process.These include age, weight, and warfarin dose, with the resulting categorical variable one-hot encoded with the first level dropped (i.e.we created binary dummy variables for each level of the category).
There were differences across the trials in how warfarin doses were recorded, so we normalized to weekly doses since this is reflective of how warfarin is prescribed in clinical practice.
Patient trajectories were organized as longitudinal sequences of dose (warfarin)-response (INR) pairs.We elected to split patient trajectories in a few scenarios.The first is when a significant adverse event was recorded (ischemic stroke, hemorrhagic stroke, major bleeding, hospitalization), since our RL model is meant to inform the management of atrial fibrillation in the community, not acute stroke or hemorrhage.We also split trajectories in cases where more than 90 days passed between clinical visits, since in these cases patients may be receiving care outside of the clinical trial (due to international travel, for example) and thus we deemed imputation unreliable in this context.Thanks to the rigor of data collection for randomized control trials, these data sets required minimal data cleaning, however some missing or impossible values for warfarin dose and INR were identified.In these cases, we elected to split patient trajectories rather than to impute values for the missing data, because imputing values for either feature could result in cases where our model received incorrect rewards, introducing noise into our learned policy.Given the length of patient trajectories and low level of missingness in this data set, we determined that splitting trajectories in these cases would not negatively impact the learning process.Impossible values (such as INR = 0) were treated as missing.After implementing these trajectory splits, we were left with 42,384 patient trajectories for development and evaluation.

Model development
Our RL model was developed with the Batch Constrained Q-learning algorithm 27 using a formulation designed for discrete action spaces 33 .The Batch Constrained Q-learning algorithm employs double deep Q-learning 34,35 , accompanied by a generative network to constrain the action space to actions previously observed.We selected this algorithm for its robustness in an "offline" setting where there are suboptimal clinical decisions in the observed data set and where complex pharmacodynamics of warfarin introduce a degree of uncertainty in predicting patient response to dose changes.
Reinforcement learning models, including the Batch Constrained Q-learning algorithm, typically structure the interaction between an agent and its environment as a Markov decision process, which assumes that state transitions occur at regularly spaced timesteps.However, in clinical practice warfarin dosing decisions are made at irregular timesteps.To account for this, we model our decision problem as a semi-Markov decision process 26,36 .
In regular clinical practice, warfarin dose decisions are made during a clinic or virtual visit, and that dose is maintained until the next visit, when a new INR result is available.Using the semi-Markov decision process framework, we model these clinical decisions as "options", which are sequences of "primitive actions" which we model as daily dose adjustments.For example, if a clinician prescribes a dose adjustment of "Increase by > 20%" on day 5, and the next visit does not occur until day 10, we model this as an option consisting of the primitive action sequence "Increase > 20%" on day 5, and "no change" on days 6-9.
In this framework, the reward for each option is derived from rewards for each primitive action.We use Rosendaal's linear interpolation method 37 to generate INR values to reward each primitive action in an option (+ 1 for INR in range, 0 for INR out of range), which are used to calculate a cumulative discounted reward for each option (Fig. 1).
Using this framework, the goal of the Batch Constrained Q-learning algorithm is to the learn the policy which maps the current patient state s to the best available option o according to its expected return Q(s, o) .The argmax function returns the option o from the set O(s) with the maximum value of Q(s, o) , where O(s) is the set of options provided by the generative network for state s .The policy π(s) is learned by optimizing the estimate of the expected return via stochastic gradient descent.The learning process was governed by the following hyperparameters: batch size, learning rate, threshold, number of hidden layers, dimensionality of hidden layers, and the moving average parameter for the value estimator.Hyperparameters were optimized using grid search.Using this RL framework, we developed a model using current and previous INR values and warfarin doses as model inputs (ie.patient state).For the secondary objective we developed a second model that made use of all of the clinical features described in the previous section.www.nature.com/scientificreports/

Implementation of the benchmark algorithm and non-optimal policies
Comparing a RL policy against observed clinician behavior can lead to overly optimistic assessments of model performance 33,34 .To address this, we implemented three additional policies to evaluate our RL policy against.The first of these was a clinical algorithm that provides clear warfarin dosing rules based on INR thresholds: if INR is less than 1.5 then increase warfarin dose by 15%, if INR 1.5-1.99 then increase by 10%; if INR 2.0-3.0 then no change, if INR 3.0-4.0then reduce dose by 10%, if INR 4.0-4.99 then hold the dose for one day and then decrease by 10%, and if the INR was greater than 5.00 the dose was to be held until the INR was therapeutic and then decrease by 15%.Use of this type of algorithm has been shown to provide significant improvement in TTR over un-assisted clinician decision making 9,35 , and thus we refer to it throughout as the "benchmark algorithm." We also implemented two intentionally non-optimal policies against which to compare both the benchmark algorithm and our RL policy.One of these is a "no change" policy, which always maintains the current warfarin dose, regardless of changes in INR.The second is a "random" policy, which randomly selects a dose change at each observed time step, with the available options weighted based on their observed frequency in our data set.

Performance evaluation
To evaluate the relationship between algorithm-consistent dosing and TTR, we fit weighted linear regression models at the center level.Center-level algorithm-consistency was calculated by averaging the difference between the observed warfarin dose and the dose recommended under each policy.Center-level TTR was calculated by averaging the values obtained from patients in each center.Models were weighted by the number of patients per center, and were developed with mean center algorithm-consistency as a predictor and mean center TTR as the outcome.A coefficient of determination (R 2 ) was reported for each model.
To estimate the relationship between center-level algorithm-consistency and TTR while controlling for patient, center and country-level factors, we developed multilevel, multivariable linear regression models, with patients nested in centers, and centers nested in countries.Patient-level characteristics included age (years); weight (kg); sex (male vs. female); ethnicity (white vs. other); current smoking status (yes vs. no); history of heart failure (yes vs. no), hypertension (yes vs. no), diabetes mellitus (yes vs. no), stroke (yes vs. no); previous warfarin use (yes vs. no); current amiodarone use (yes vs. no); and current insulin use (yes vs. no).Center-level characteristics included specialty (anticoagulation clinic vs. other), setting (secondary/tertiary hospital vs. primary), and center-level mean algorithm-consistent warfarin dosing (%).Country-level characteristics included the 2006 country Gross Domestic Product (high income vs. medium/low income), Disability Adjusted Life Expectancy (years), and Health System Performance Index.We also included random intercept for center and country.
Finally, to estimate the association between algorithm-consistent dosing and the composite outcome of stroke, systemic embolism, or major hemorrhage we developed multilevel, multivariable Cox proportionate hazard models.The Cox models were adjusted for the same patient-, center-, and country-level characteristics as in the previous multilevel model, including random intercepts.Additionally, four medication variables were included, which were baseline use of aspirin (yes vs. no), β-blockers (yes vs. no), ACE-inhibitors (yes vs. no), or statins (yes vs. no).Because the effect of warfarin dosing on clinical outcomes is mediated through INR control, TTR was not included as a covariate in the Cox models.Patients who experienced stroke, systemic embolism, or major hemorrhage were censored when the first adverse event happened, and algorithm-consistency was calculated using warfarin doses before their first adverse event; for patients who did not experience an event, algorithm consistency was calculated using all warfarin doses throughout the study.Algorithm-consistency was calculated for each center and was analyzed as a center-level variable.
To evaluate whether algorithm consistent dosing at the center-level was a marker of generalized high quality care, independent of its influence on warfarin control, we refit the multilevel Cox proportional hazards model on patients taking dabigatran to assess whether center-level algorithm consistency could independently predict center-level outcomes for dabigatran patients.

Ethics
This paper is a sub-study of COMBINE AF 32 , approved by the Duke University Institutional Review Board.Additionally, each of the four randomized control trials used in this study had institutional review board approval [28][29][30][31] .All participants in these trials provided informed consent, which included consent for secondary use of trial data in future research.

Results
A total of 28,232 patients with atrial fibrillation from the warfarin arms of four clinical trials of novel oral anticoagulants vs. warfarin were included in our study.The development set was made up 22,502 patients from 52 countries enrolled in the ARISTOTLE, ENGAGE AF-TIMI 48 and ROCKET AF trials.External validation set was made up of 5,730 patients from 44 countries enrolled in the RE-LY trial.All countries included in the external validation set were represented in the development set.The studies were well balanced with respect to most baseline patient characteristics (Table 1).Studies varied slightly in the distribution of continent (RE-LY had more patients from North America and Western Europe, for example) and ethnicity (RELY and ENGAGE have relatively few patients who identified as Hispanic).The most significant difference between the studies was the relatively higher rates of comorbidities in ENGAGE and ROCKET (mean CHADS2 score of 2.8 and 3.5 respectively), vs. ARISTOTLE and RE-LY (CHADS 2 scores of 2.1), which was due to differences in the number of risk factors for stroke required for inclusion in each trial (1 for RE-LY and ARISTOTLE, 2 or more for ENGAGE and ROCKET).
Policies based on observed clinician behavior, the benchmark algorithm and our RL model are visualized in Fig. 2. Observed clinician behavior suggests some clinicians tend to maintain a dose even when INR well out Vol:.( 1234567890    www.nature.com/scientificreports/ on INR thresholds, and so produces a very clean policy with no apparently unintuitive recommendations.The benchmark algorithm was designed to minimize dramatic swings in INR and so does not recommend any dose changes greater than 15%, even when INR is well out of range.The RL policy appears to be more aggressive than the others, rarely maintaining a dose when the INR is out of range and recommending dosage adjustments of > 20% relatively frequently (63% of cases where INR was < 1.5 and 20% of cases where INR was > 3.5).We conducted a manual post hoc review of cases where the RL policy appears to offer unintuitive dose adjustments when INR out of range and found that these were examples of the same temporary warfarin suspension/resumption scenarios described above.We evaluated the potential effectiveness of the RL policy to improve TTR using a weighted linear regression of the association between mean center algorithm agreement and mean center TTR.This analysis was conducted on our external validation set, which included 5730 patients from 903 centers spread across 44 countries.This analysis showed a positive association between RL-consistent dosing and TTR at the center-level (R 2 = 0.56).We carried out the same analysis for the benchmark algorithm and found a similarly strong association between algorithm consistent dosing and TTR (R 2 = 0.53) that is consistent with previous research 9 (Fig. 3).
We further explored the relationship between algorithm consistent dosing and TTR using a multilevel, multivariable linear regression, with patients nested in centers, and centers nested in countries (Table 2).After adjusting for patient, center, and country characteristics, center-level RL-consistent dosing was the strongest predictor of TTR, with each 10% increase in algorithm-consistent dosing at the center level predicting a 6.78% improvement in TTR (95% CI 6.29, 7.28, p < 0.001).We repeated this analysis for the benchmark algorithm and found comparable results, with each 10% increase in algorithm-consistent dosing predicting a 6.10% increase in TTR (95% CI 5.67, 6.54, p < 0.001).
To examine the degree to which the improvement in TTR associated with RL-consistent dosing might translate into reductions in stroke and other events, we developed a multilevel, multivariable Cox proportional hazard model to examine the relationship between center-level algorithm-consistency and a composite outcome of stroke, systemic embolism, or major hemorrhage (Table 3).We found that each 10% increase in RL-consistent dosing was associated with a 11% decrease in the composite outcome (HR 0.89; 95% CI 0.81, 0.98, p = 0.015).www.nature.com/scientificreports/ We repeated this analysis for the benchmark algorithm and found that every 10% increase in agreement was associated with an 10% decrease in the composite outcome (HR 0.90; 95% CI 0.83, 0.98, p = 0.018).The association between algorithmically consistent dosing and the composite clinical outcome suggests a potential causal relationship, but a lower event rate at centers with better anticoagulation management could alternately be the result of generally higher quality care at those centers.However, if the lower composite clinical event rate were the result of higher quality of care at the center-level, then we would expect to see a similarly lower rate of events among dabigatran patients at the same centers.To assess this, we refit the Cox model for dabigatran patients in the evaluation set (N = 9467), but found no association between center-level algorithm consistent dosing and the composite clinical event rate for patients on dabigatran (HR 0.96; 95% CI 0.90, 1.02, p = 0.16).
To assess whether the inclusion of variables associated with warfarin metabolism provided any improvement in performance over a more strict focus on INR and previous dose, we used the same multilevel, multivariable regression models to evaluate a second RL model that included clinical features such as age, ethnicity, and concurrent medications in addition to INR and previous warfarin doses.There was very little difference in the performance of these two models, with each 10% increase in algorithm-consistent dosing predicting a 6.78% improvement in TTR in the case of the INR/dose only model and 6.49% in the case of the model that included the additional clinical features.The difference was slightly larger in the case of events, with each 10% increase in algorithm-consistent dosing associated with a 10% decrease in the composite outcome for the INR/dose only model and 7% for the model that included the additional clinical features.

Discussion
In this study of patients with atrial fibrillation receiving vitamin K antagonist therapy, we demonstrate that a RL model could achieve TTR improvement and event reduction equivalent to an established rules-based warfarin dosing algorithm designed by clinical experts.This data driven model was developed using 3 international randomized control trials, and externally validated on a fourth trial of patients living in 44 countries across 6 continents.In addition to potentially improving warfarin dosing for atrial fibrillation (still being used commonly around the world despite the rise of novel oral anticoagulants) the methodology described in this paper may have important applications in other clinical domains, including dosing of warfarin for other indications (such as patients with mechanical valves) as well as optimization of other medications where dosing is based on sequential dose-response relationships, such as heparin, insulin, oral antihyperglycemic drugs, and phenytoin.
This study helps address some of the existing knowledge gaps with respect to the use of machine learning to improve warfarin dosing.A number of machine learning-based systems have been developed, but many of these have been focused on patients within a single jurisdiction or ethnic group, which may limit their generalizability [41][42][43][44][45][46] .A number of machine learning studies have leveraged International the Warfarin Pharmacogenetics Consortium dataset of 5700 patients from 21 countries, however many of these have incorporated genetic markers associated with warfarin metabolism [46][47][48][49][50] , which may offer valuable performance improvements, but cannot be easily integrated into routine clinical practice (while single nucleotide polymorphisms CYP2C9 and VKORC1 were available for a subset of our patients, we elected not to use them in order to enhance the generalizability of our model for real world clinical settings, especially low and middle income countries where warfarin is still routinely used).Ours is also the first study we are aware of to evaluate the use of machine learning for the long-term dynamic management of warfarin in the community, versus warfarin initiation or short-term management during a hospital stay [42][43][44][45][46][47] .Our study is the largest its kind by a considerable margin, which is relevant given the data-hungry nature of machine learning methods.
Our study also reveals that clinical features related to warfarin metabolism do not appear to provide sufficient information to justify their inclusion in a model.While one of the strengths of a machine learning approach is to be able to incorporate high dimensional data, in this case the established clinical approach of including only INR and previous warfarin doses to drive decision making produced the best performing model.While are unable to say definitively what was responsible for this phenomenon, given the nature of neural networks it is likely that it is due to a form of overfitting.In particular, the personalized model may have been learning spurious correlations between demographic features and INR that do not generalize between our development data set and the external validation data set (Table S1).
A common limitation of most reinforcement learning research in the health domain has been the lack of rigorous statistical evaluation.Unlike supervised learning approaches that can be evaluated using standard techniques for classification and regression, evaluation of RL models on observational data have typically been limited to importance sampling or ad hoc methods (such as U-curves), both of which have limitations that can produce performance estimates that are overly-optimistic 19,38,39 .We were able to address this limitation by taking advantage of the multicenter structure of our data to apply statistically robust techniques developed in a previous study of warfarin dosing algorithms 9,40 .This approach allowed us to provide a rigorous estimate the relationship between algorithmically consistent dosing, TTR, and patient outcomes.
Our study has a number of important advantages over some other reinforcement learning applications in health care, owing to the structure of the warfarin dosing problem.In many clinical domains where RL has been applied, such as sepsis, data sets have not always contained all the data relevant to clinical decision making, which can lead to estimates confounded by spurious correlation 38 .In addition, RL studies in domains such as sepsis can suffer from a sparsity of rewards, where events of interest can be so infrequent that it becomes challenging to assess the value of an individual treatment decision 39 .In our case, however, the data set contained all the relevant data points used to manage warfarin in routine clinical practice.Moreover, the causal connection between INR and clinically relevant outcomes is well-established, which allowed us to use INR results to reward (or not) every decision the model observed during training.
Finally, it is important to note that while our study establishes that RL can be used to learn an optimal policy for warfarin dosing, it provides a relatively small improvement over that of a rules-based clinical algorithm.While our study suggests promise for the future of artificial intelligence to learn directly from clinical data to enhance care, it is important to recognize just how effective a well-researched, interpretable rule-based algorithm developed by clinical experts can be in supporting better clinical practice.

Limitations
Our study is retrospective, so prospective evaluation through a randomized control trial should be undertaken prior to adoption of our model into routine clinical practice.Because our study is based on data from the setting of clinical trials with strict inclusion/exclusion criteria and selected centers, our results may underestimate real-world variation in TTR between centers 51,52 as well as the association between warfarin dosing and clinical outcomes.While our evaluation methods account for some social factors such as country-level GDP, we were not able to consider factors such as income or education level, which can include warfarin dose variability, TTR and outcomes.Our methodology employs deep neural networks, whose inner workings are black boxes -while explainable machine learning techniques may go some way to increasing the clinical acceptability of black box models, these techniques have some limitations of their own 53 .

Summary
In summary, we trained a deep reinforcement learning model on longitudinal data from 22,502 patients on warfarin, and externally validated this model on a further 5730 warfarin patients from 44 countries across 6 continents.Our findings suggest that a machine learning algorithm is capable of optimizing time in therapeutic range for patients taking warfarin.A digital clinical decision support system to promote algorithm-consistent warfarin dosing could optimize time in therapeutic range and improve clinical outcomes in atrial fibrillation globally.

Figure 1 .
Figure 1.Warfarin dosing optimization as a sequential reinforcement learning task using semi-Markov Decision Processes.The figure illustrates the first several time intervals of a patient trajectory.The highlighted values in the top part of the figure illustrate a single patient trajectory.Each observed time step (T = 0, T = 1, etc.) has a corresponding International Normalized Ratio (INR) test result.The action space is illustrated in purple and is defined as percent changes in warfarin dose-the arrows illustrate that the effect of the change in dose is observed at the following time step.The dynamic elements of the state space are INR test results.We employ a semi-Markov Decision Process framework to handle the inconsistent time intervals between observations inherent to warfarin management, illustrated in the bottom part of the figure.Linear interpolation (Rosendaal's method) is used to generate INR values for intermediate unobserved time steps.We then calculate intermediate rewards based on interpolated INR values (rewarded when INR is in the range of 2-3) and apply a cumulative discounted rewards to each observed time step.The cumulative discounted reward for an observed time step is applied to the action taken at the previous observed time step (e.g., the dosing action observed at T = 1 is rewarded based on the INR value observed at T = 2).We do not model warfarin initiation, so there is no reward function at T = 0. Observed time step are displayed in darker shades; unobserved interpolated time steps are displayed in lighter shades.
π(s) = argmax o∈O(s) Q(s, o) Vol.:(0123456789) Scientific Reports | (2024) 14:4516 | https://doi.org/10.1038/s41598-024-55110-9 4) Continued of range (clinicians maintained the dose 32% of the time when INR was < 1.5 and 25% of the time when INR was > 3.5), while others will make relatively large dose adjustments in the same circumstances (36% of the time clinicians increased the dose by > 20% when INR was < 1.5 and 28% of the time decreased the dose by > 20% when the INR was > 3.5).Observed clinical behavior appears to include examples of clinically unintuitive behavior, such as cases where clinicians decrease the dose dramatically when INR was < 1.5 and increased it substantially when INR was > 3.5.Upon reviewing these cases, we found them to be clinical scenarios in which a clinician had temporarily suspended and then resumed a patient's warfarin dose in response to an extreme INR or a clinical event such as an emergency surgery.The benchmark algorithm provides dosing recommendations based purely Table 1.Baseline patient characteristics, adverse event rates and study characteristics by trial.Continuous variables listed as mean (standard deviation).Categorical variables listed as %.Likely erroneous values/ outliers (described in "methods" section) were not included in calculations for weight, body mass index, or creatinine clearance.BMI = body mass index, CABG = coronary artery bypass graft, CAD = coronary artery disease, MI = myocardial infarction, NSAID = non-steroidal anti-inflammatory, PCI = percutaneous coronary intervention, PPI = proton pump inhibitor, TIA = transient ischemic attack, VKA = vitamin K antagonist.

Figure 2 .
Figure 2. Comparison of clinician, reinforcement learning (RL), and benchmark policies.The figure illustrates clinician, benchmark algorithm and RL policies using heatmaps, where the X axis illustrates INR result bins, and the Y axis illustrates the distribution of dose recommendations within each INR result bin.On average, the RL policy recommends larger dose changes in the context of very high and very low INR values than either the clinician or benchmark policies.

Figure 3 .
Figure 3. Weighted linear regression of the association between mean center algorithm-consistency and mean center time in therapeutic range (TTR).Mean center algorithm-consistent dosing and TTR were calculated by averaging the values obtained from patients in each center.The regression model was weighted by the number of patients per center.Each data point represents a single center, and the size of the data point represents the number of patients in that center.

Table 2 .
Multilevel multivariable linear regression model for patient-level time in therapeutic range.

Table 3 .
Multi-level multivariable cox proportional hazard model for stroke, systemic embolism, or major hemorrhage.